Incremental Training and Intentional Over-fitting of Word Alignment

نویسندگان

  • Qin Gao
  • Will Lewis
  • Chris Quirk
  • Mei-Yuh Hwang
چکیده

We investigate two problems in word alignment for machine translation. First, we compare methods for incremental word alignment to save time for large-scale machine translation systems. Various methods of using existing word alignment models trained on a larger, general corpus for incrementally aligning smaller new corpora are compared. In addition, by training separate translation tables, we eliminate the need for any re-processing of the baseline data. Experimental results are comparable or even superior to the baseline batch-mode training. Based on this success, we explore the possibility of sharpening alignment model via incremental training scheme. By first training a general word alignment model on the whole corpus and then dividing the same corpus into domainspecific partitions, followed by applying incremental training to each partition, we can improve machine translation quality as measured by BLEU.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Length-Incremental Phrase Training for SMT

We present an iterative technique to generate phrase tables for SMT, which is based on force-aligning the training data with a modified translation decoder. Different from previous work, we completely avoid the use of a word alignment or phrase extraction heuristics, moving towards a more principled phrase generation and probability estimation. During training, we allow the decoder to generate ...

متن کامل

DCU-Symantec at the WMT 2013 Quality Estimation Shared Task

We describe the two systems submitted by the DCU-Symantec team to Task 1.1. of the WMT 2013 Shared Task on Quality Estimation for Machine Translation. Task 1.1 involve estimating postediting effort for English-Spanish translation pairs in the news domain. The two systems use a wide variety of features, of which the most effective are the word-alignment, n-gram frequency, language model, POS-tag...

متن کامل

Improving Low-Resource Statistical Machine Translation with a Novel Semantic Word Clustering Algorithm

In this paper we present a non-languagespecific strategy that uses large amounts of monolingual data to improve statistical machine translation (SMT) when only a small parallel training corpus is available. This strategy uses word classes derived from monolingual text data to improve the word alignment quality, which generally deteriorates significantly because of insufficient training. We pres...

متن کامل

A Maximum Entropy Word Aligner for Arabic-English Machine Translation

This paper presents a maximum entropy word alignment algorithm for ArabicEnglish based on supervised training data. We demonstrate that it is feasible to create training material for problems in machine translation and that a mixture of supervised and unsupervised methods yields superior performance. The probabilistic model used in the alignment directly models the link decisions. Significant i...

متن کامل

Refining Kazakh Word Alignment Using Simulation Modeling Methods for Statistical Machine Translation

Word alignment play an important role in the training of statistical machine translation systems. We present a technique to refine word alignments at phrase level after the collection of sentences from the Kazakh-English parallel corpora. The estimation technique extracts the phrase pairs from the word alignment and then incorporates them into the translation system for further steps. Although ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011